STATS 32 Session 2: Packages and Data Frames

Kenneth Tay

Oct 4, 2018

Gear Up for Social Science Data Extravaganza

Oct 19 (Fri), Graduate School of Business Library & Green Library
Various activities from 10am - 5pm
Data talks & training, data expo, data analysis demos
More information here: https://events.stanford.edu/events/805/80540/

Recap of session 1

R as calculator
Types of variables in R
- Numeric (integer, double)
- Character/string
- Boolean (TRUE or FALSE)
Homogeneous data structures
- Vectors: 1D array, elements of same type
- Matrices: 2D array, elements of same type
Heterogeneous data structures
- Lists: Collection of key-value pairs, value can be anything

Vectors

Can be created using the c() function, or using the : shortcut
Elements in a vector have to be of the same type

vec <- c(10, 5, 20)

vec <- 1:10 * 2
vec

##  [1]  2  4  6  8 10 12 14 16 18 20

To extract a subset of elements by their indices, put a vector of indices in square brackets:

vec[c(1,5)]

## [1]  2 10

Lists

A collection of key-value pairs
Created with the list() function

cars <- list(make = "Honda", 
             models = c("Fit", "CR-V", "Odyssey"), 
             available = c(TRUE, TRUE, TRUE))

[[ or $ notation to refer to a specific key-value pair

cars$make

## [1] "Honda"

cars[["models"]]

## [1] "Fit"     "CR-V"    "Odyssey"

Agenda for today

Data frames
3 types of syntax in R
Functions and packages

What is a data frame?

A data frame is R’s data structure for storing datasets
- First row: variable/covariate/feature names
- Each subsequent row represents one observation
- Each column contains the values of that variable across observations
Tibbles: A data frame with some cosmetic changes

Example of a dataset

R’s syntax for creating data frames

df <- data.frame(votes_dem = c(486351, 318, 5904), 
                 votes_gop = c(91189, 211, 10239))
df

##   votes_dem votes_gop
## 1    486351     91189
## 2       318       211
## 3      5904     10239

Data frames “under the hood”

To R, a data frame is simply a special type of list!
- Keys of the list are the variable/covariate names
- Values are vectors of the same length
Because it is special, data frames have some additional functionality

is.list(df)

## [1] TRUE

df$votes_dem

## [1] 486351    318   5904

R’s syntax

3 different types of syntax:

Function syntax
+ syntax for plotting with ggplot2 (Session 3)
%>% syntax for transforming data with dplyr (Session 4)

Functions: R’s workhorse

A function is a named block of code which

Takes in 1 or more inputs from the user,
Performs a specific task, and
Returns an output to the user.

(Source: practicalli.github.io)

(Source: codehs.gitbooks.io)

We use functions in R all the time

We’ve already seen a number of functions in R! For example,

is.character("123")

## [1] TRUE

The function is.character takes the input given to it in the parentheses and returns TRUE or FALSE, depending on whether the input is of type character or not.

Others we’ve seen: str, log, typeof, rm, c, list, length, …

We can see what a function does by typing in ? followed by the function name in the R console.

?is.character

Structure of an R function call

A function call consists of:

Function name
Parentheses, and
A list of arguments within the parentheses
- Options that change what the function does slightly

`mean()`: An example

Take the mean of c(1,3,NA).

mean(c(1,3,NA))

## [1] NA

mean(c(1,3,NA), na.rm = TRUE)

## [1] 2

`sample()`: Description

`sample()`: Usage

What comes after the = sign: default value for that argument

`sample()`: Arguments

`sample()`: Details

`sample()`: Value

How does R know which arguments we are referring to?

sample(x = 1:10, size = 10)

##  [1]  4  9  7  6 10  1  8  5  3  2

sample(1:10, 10, TRUE)

##  [1]  9  2  2  2 10  1  2  4  8  3

sample(size = 5, 1:10)

## [1] 5 1 3 4 2

Functions can be “chained” together

Commands are evaluated “from inside out”

is.character(as.character(123))

## [1] TRUE

Packages

Many people around the world are trying to do the same thing, why not share code?
A package is a collection of R functions, data, documentation and tests
Hadley Wickham: “the fundamental unit of shareable code”
R comes in-built with a core set of packages
- E.g. base, datasets, graphics, stats
Most user-created packages are available on The Comprehensive R Archive Network (CRAN)
- Packages are generally well-maintained and have good documentation
- As of 3 Oct 2018, there are 13,122 packages available on CRAN
For packages related to bioinformatics, see Bioconductor

Today’s dataset: Fuel economy

(Source: SuperCars)

`fueleconomy`: Package information on CRAN

https://cran.r-project.org/web/packages/fueleconomy/index.html

Optional material

Why use functions?

Reason #1: Functions make code more understandable.

Both for others and for you (6 months down the road)

Example: What is the line of code below trying to do?

x <- c(4, 234, 1, 50, 764)
x <- (x - min(x)) / (max(x) - min(x))
#> [1] 0.003931848 0.305373526 0.000000000 0.064220183 1.000000000

rescale01 <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
rescale01(c(4, 234, 1, 50, 764))

## [1] 0.003931848 0.305373526 0.000000000 0.064220183 1.000000000

Why use functions?

Reason #2: Functions make code more concise.

DRY principle: Don’t Repeat Yourself
Minimize chance of making errors

list$a <- (list$a - min(list$a)) / (max(list$a) - min(list$a))
list$b <- (list$b - min(list$b)) / (max(list$b) - min(list$b))
list$b <- (list$c - min(list$c)) / (max(list$c) - min(list$c))

vs.

list$a <- rescale01(list$a)
list$b <- rescale01(list$b)
list$c <- rescale01(list$c)

Can you spot the mistake in the first block?

Why use functions?

Reason #3: Functions enable code reuse and code changes.

list$a <- (list$a - min(list$a)) / (max(list$a) - min(list$a))
list$b <- (list$b - min(list$b)) / (max(list$b) - min(list$b))
list$c <- (list$c - min(list$c)) / (max(list$c) - min(list$c))

vs.

rescale01 <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
list$a <- rescale01(list$a)
list$b <- rescale01(list$b)
list$c <- rescale01(list$c)

What if I want to rescale the entries to be between 0 and 2 instead?

List of useful packages

dplyr: Transform data
ggplot2: Make nice plots
readr: Import data into R
tidyr: Clean data
stringr: Tools for working with character strings and regular expressions
lubridate: Make working with dates and times easier
caret: Tools for training regression and classification models
glmnet: Advanced regression methods
maps, ggmap: Tools for plotting spatial data
shiny: Make interactive web apps

Measures of central tendency

Mean: sum of all values divided by the number of values
Mode: most commonly occuring value
$x$th percentile: value such that $x$% of the values fall below it
- Median: 50th percentile
- 1st quartile: 25th percentile
- 3rd quartile: 75th percentile

Measures of spread

Variance: average squared deviation from the mean
Standard deviation: square root of variance
Interquartile range: 3rd quartile - 1st quartile

STATS 32 Session 2: Packages and Data Frames

Gear Up for Social Science Data Extravaganza

Recap of session 1

Vectors

Lists

Agenda for today

What is a data frame?

R’s syntax for creating data frames

Data frames “under the hood”

R’s syntax

Functions: R’s workhorse

We use functions in R all the time

Structure of an R function call

mean(): An example

sample(): Description

sample(): Usage

sample(): Arguments

sample(): Details

sample(): Value

How does R know which arguments we are referring to?

Functions can be “chained” together

Packages

Today’s dataset: Fuel economy

fueleconomy: Package information on CRAN

Why use functions?

Reason #1: Functions make code more understandable.

Why use functions?

Reason #2: Functions make code more concise.

Why use functions?

Reason #3: Functions enable code reuse and code changes.

List of useful packages

Measures of central tendency

Measures of spread

`mean()`: An example

`sample()`: Description

`sample()`: Usage

`sample()`: Arguments

`sample()`: Details

`sample()`: Value

`fueleconomy`: Package information on CRAN